Geo-Visual Analytics: Site Selection for Starbucks

Based on Giulia Carella's Notebook presented in:

SDSC20 Workshop - Integrating CARTOframes into Spatial Data Science workflows. October 21 2020

Practice goals:

In this practice we will be a Spatial Data Science in a geo-planning department. We will lead a project for site selection to our company, Starbucks. Site selections refers to the process of deciding where to open a new or relocate an exiting store/facility by comparing the merits of potential locations. In previous practices, we learnt the concept of potential as the total amount of expense/revenues that an entity (e.g. a company, shop, etc...) can generate. If some attribute of interest is known (e.g. the revenues of existing stores) statistical and machine learning methods can be used to help with the decision process.

In this practice, we will support our company to look at where Starbucks should open new coffee shops in Long Island, NY. Specifically, we will build a model for the revenues of the existing stores as a funcion of sociodemographic data and use this model to predict the expected revenue in each block group.

We will:

  • Read and upload to CARTO the revenue data
  • Geocode the store adresses (i.e. convert addresses to coordinates)
  • Download for Long Island the US Census data by block group
  • Enrich each store location with the corresponding block group level Census data
  • Build a predictive model for the store revenues
  • Publish a visualization with the results

0. Context: Site Selection

0.1 Set up the environment

First of all, install virtualenv inside your project folder to secure your project directory to avoid conflict with your other packages.

pip install virtualenv

After installing this run this command one by one inside your root project directory:

virtualenv venv

source venv/bin/activate

Now Your directory is secure and you can install your required packages inside.

pip install -r requirements.txt

(In case of error when installing Rtree package, please, use the following instruction in your directory: pip install "rtree>=0.8,<0.9"

To exit the virtual environment type:

deactivate

0.2 Imports

In [294]:
from requirements import *
from ppca import PPCA
from utils import *

0.3 Set CARTO credentials

Access to you Carto account (www.Carto.com) and access to your CARTO dashboard (with your login and password) and click on API KEY to get your key.

In [295]:
carto_username = '*'
carto_API = '*'

set_default_credentials(username=carto_username, api_key=carto_API)

0.4. Dataset: Upload Starbucks store data to your CARTO account

First we upload the Starbucks data that are stored in your local in the starbucks_long_island.csv file. The data contains the addresses of Starbucks stores, their annual revenue ($) and some store characteristics (e.g. the number of open hours per week).

In [296]:
stores = pd.read_csv('./data/starbucks_long_island.csv')
stores.head(3)
Out[296]:
name addresslines hours_open_per_week gc_status_rel has_lu has_cl has_wa has_vs zipcode storenumber revenue
0 7000 Austin Street 7000 Austin Street,Forest Hills, NY 11375 108.0 0.97 1 1 1 1 11375 7217-623 1.016898e+06
1 107-12 Continental Avenue 107-12 Continental Avenue,Forest Hills, NY 11375 116.5 0.95 1 0 1 0 11375 7349-910 4.470646e+05
2 Stop & Shop-Forest Hills #539 8989 Union Turnpike,Forest Hills, NY 11375 112.0 0.97 0 0 1 0 11375 73708-104729 7.028071e+05

1. Data understanding

[EX 1] Explore the dataset to identify the number of registers or samples, variables, nulls and type of variables

In [297]:
stores.info(memory_usage=False)
print('\n' * 2)
print('The dataset has 177 entries with 11 different features each.')
print('All the variables are null-free.')
print('All the variables except: Name, Addresslines and StoreNumber are recognised as numbers (either int or float.)')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 177 non-null    object 
 1   addresslines         177 non-null    object 
 2   hours_open_per_week  177 non-null    float64
 3   gc_status_rel        177 non-null    float64
 4   has_lu               177 non-null    int64  
 5   has_cl               177 non-null    int64  
 6   has_wa               177 non-null    int64  
 7   has_vs               177 non-null    int64  
 8   zipcode              177 non-null    int64  
 9   storenumber          177 non-null    object 
 10  revenue              177 non-null    float64
dtypes: float64(3), int64(5), object(3)


The dataset has 177 entries with 11 different features each.
All the variables are null-free.
All the variables except: Name, Addresslines and StoreNumber are recognised as numbers (either int or float.)

[EX 2] Plot the distribution and the boxplot of the annual revenues by store

In [298]:
fig = plt.figure(figsize=(5, 5))
  
fig,axes=plt.subplots(1,2, figsize=(15, 5))
sns.distplot(stores['revenue'], color="lightblue", ax=axes[0])
sns.boxplot(data=stores, y= 'revenue', orient='v', ax=axes[1])
plt.show()
<Figure size 360x360 with 0 Axes>

[EX 3] Which other exploratory analysis do you propose to have a clear understanding of the data?

In [299]:
list_binary_varibables = []
for col in stores.columns: 
    if len(stores[col].unique()) == 2:
        list_binary_varibables.append(col)

list_binary_varibables
Out[299]:
['has_lu', 'has_cl', 'has_wa', 'has_vs']
In [300]:
#for variables with only two values, we plot countplots
for var in list_binary_varibables:
    ax = sns.countplot(x=var, data=stores)
    plt.show()
In [301]:
#for the other numerical variables, we plot a histogram and a boxplot

for var in ['hours_open_per_week', 'gc_status_rel']:
    fig = plt.figure(figsize=(5, 5))
  
    fig,axes=plt.subplots(1,2, figsize=(15, 5))
    sns.distplot(stores[var], color="lightblue", ax=axes[0])
    sns.boxplot(data=stores, y= var, orient='v', ax=axes[1])
    plt.show()
<Figure size 360x360 with 0 Axes>
<Figure size 360x360 with 0 Axes>

2. Spatial analytics: Geocode locations and visualization

Since we only know the addresses, we first need to geocode them to extract the corresponding geographical locations, which will be used to assign to each store the Census data of the corresponding block group.

One of the key services that Cartoframes provides in geocoding through the library: from cartoframes.data.services import Geocoding

Tip: You will find Cartoframes' API description in this link: https://carto.com/developers/cartoframes/reference/#heading-Data-Services

[EX 4] Apply the Cartoframe's geocoding service (i.e. Geocoding()) to geocode from addresslines variable in our dataframe. Store the output into the storesdataframe.

In [302]:
#stores = Geocoding().geocode(stores, 'addresslines').data
In [303]:
stores = gpd.read_file('./data/table_4c455fc710.geojson')

Done! Now that the stores are geocoded, you will notice a new column named geometry has been added. This column stores the geographic location of each store and it’s used to plot each location on the map.

[EX 5] Which is the name of the column with the geographic location? What kind of spatial data object is?

In [304]:
print('The name of the column with the geographic location is \' the_geo\' and it is a geometry object, a POINT.')
stores.head(3)
The name of the column with the geographic location is ' the_geo' and it is a geometry object, a POINT.
Out[304]:
cartodb_id name addresslines hours_open_per_week gc_status_rel has_lu has_cl has_wa has_vs zipcode storenumber revenue carto_geocode_hash geometry
0 1 7000 Austin Street 7000 Austin Street,Forest Hills, NY 11375 108.0 0.97 1 1 1 1 11375 7217-623 1.016898e+06 2d576a590508cca91deb041deeb010d0 POINT (-73.84791 40.72111)
1 2 107-12 Continental Avenue 107-12 Continental Avenue,Forest Hills, NY 11375 116.5 0.95 1 0 1 0 11375 7349-910 4.470646e+05 98bcb148c9fbfc600eebd767c0b92dd1 POINT (-73.84563 40.71873)
2 3 Stop & Shop-Forest Hills #539 8989 Union Turnpike,Forest Hills, NY 11375 112.0 0.97 0 0 1 0 11375 73708-104729 7.028071e+05 bebcdc97f8a0dd3dfe28a189e31e375c POINT (-73.85552 40.70676)

[EX 6] Let's create our first visualization using Cartoframe's Map and Layer classes. Use Map(Layer(dataframe))to build a map.

Tip: Check out the Visualization guide to learn more about the visualization capabilities inside of CARTOframes.

In [305]:
Map(Layer(stores))
Out[305]:

[EX 7] Congrats! You have done your first map visualization. Now, create a new map with the following characteristics (see this link: https://carto.com/developers/cartoframes/reference/#heading-Map for further information about parameters of Map(Layer()):

  • the size of every shop depends on its revenues categorized in 6 ranges: <400000, 400K-700K, 700K-1M, 1M-1.3M, 1.3M-1.6M and over 1.6M
  • with a legend with title='Annual Revenue', description='STARBUCKS', footer ='Size of the blob'
  • plot' size of (920, 400)
  • show_info = True
  • center at 'latittude': 40.646891, 'longitude': -73.869378 and zoom of 10

Tip: You may use the function size_bins_style from Cartoframes. Introduce help(size_bins_style) or help(size_bins_legend) in the notebook to get information about these functions

In [306]:
plot_size = (920, 400)

breaks = [400000,700000, 1000000, 1300000, 1600000]
size_bins_st = size_bins_style('revenue', breaks = breaks)
size_bins_leg = size_bins_legend(title='Annual Revenue', description='STARBUCKS', footer='Size of the blob')

viewport = {'zoom': 10, 'lat':40.646891, 'lng': -73.869378}

Map(Layer(stores, size_bins_st, size_bins_leg), size = plot_size, viewport = viewport,show_info=True)
Out[306]:

Upload data to CARTO

Carto gives us the option to upload the map or dashboard to our repository in Carto's cloud as follows:

In [307]:
to_carto(stores,'starbucks_long_island_geocoded', if_exists = 'replace')
Success! Data uploaded to table "starbucks_long_island_geocoded" correctly
Out[307]:
'starbucks_long_island_geocoded'

3. Enrichment: Download demographic and socioeconomic data by block group

Next, we will download from CARTO DATA Observatory (www.carto.com/data) the demographic and socioeconomic variables that we will use to build a model for the stores revenues. We will use data from the The American Community Survey (ACS), which is an ongoing survey by the U.S. Census Bureau. The ACS is available at the most granular resolution at the census block group level, with the most recent geography boundaries available for 2015. The data that we will use are from the survey run from 2013 to 2017.

Upload Long Island geometry

As we are interested only in the data for the Long Island area, we first download the geographical boundaries of this area from the manhattan_long_island.geojson file, which is stored locally.

In [308]:
import geopandas as gpd
aoi = gpd.read_file('./data/manhattan_long_island.geojson')

If we represent the aoi.head() we obtain:

[EX 8] Explore the geometry feature. What kind of spatial object is?

In [309]:
print('The spatial object is a polygon.')
aoi.info()
aoi.head()
The spatial object is a polygon.
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   cartodb_id  1 non-null      int64   
 1   geometry    1 non-null      geometry
dtypes: geometry(1), int64(1)
memory usage: 144.0 bytes
Out[309]:
cartodb_id geometry
0 1 POLYGON ((-74.00617 40.69951, -74.00860 40.690...

[EX 9] Create a map layer to represent the Long Island area with the following characteristics:

  • basic_style (color = '#eaff00', stroke_color = '#eaff00', stroke_width = 2, opacity = 0.2)
  • show_info: True
  • viewport: zomm=7.5, lattitute= 40.845419 and longitude= -72.841376
  • size: (920, 400)
In [310]:
viewport = {'zoom': 7.5, 'lat':40.845419, 'lng': -72.841376}

plot_size = (920, 400)



basicstyle = cartoframes.viz.basic_style(color = '#eaff00', stroke_color = '#eaff00', stroke_width = 2, opacity = 0.2)

Map(Layer(aoi, basicstyle),  size = plot_size, viewport = viewport,show_info=True)
Out[310]:

Look for the datasets of interest using the Data Discovery methods

More documentation here: https://carto.com/developers/cartoframes/guides/Data-discovery/

Let's start by exploring the Data Catalog to see what data categories CARTO can offer for data in the US.

In [311]:
Catalog().country('usa').categories
Out[311]:
[<Category.get('behavioral')>,
 <Category.get('covid19')>,
 <Category.get('demographics')>,
 <Category.get('derived')>,
 <Category.get('environmental')>,
 <Category.get('financial')>,
 <Category.get('housing')>,
 <Category.get('human_mobility')>,
 <Category.get('points_of_interest')>,
 <Category.get('road_traffic')>]

Then, for demographics, let's check the available providers and geometries and select the most recent ACS data at the block group level

In [312]:
Catalog().country('usa').category('demographics').datasets.to_dataframe().provider_id.unique()
Out[312]:
array(['ags', 'worldpop', 'experian', 'usa_acs', 'usa_bls', 'mbi',
       'gbr_cdrc'], dtype=object)

Let's select usa_acs provider from the previous ones and check the catalogue of enriched data available.

In [313]:
catalog_usa_demo = Catalog().country('usa').category('demographics').datasets.to_dataframe()
catalog_usa_demo_acs = catalog_usa_demo[catalog_usa_demo.provider_id == 'usa_acs']
catalog_usa_demo_acs.geography_name.unique()
Out[313]:
array(['Census Tract - United States of America (2015)',
       'Core-based Statistical Area - United States of America (2019)',
       'Census Place - United States of America (2019)',
       'Congressional District - United States of America (2019)',
       'State - United States of America (2015)',
       'School District (unified) - United States of America (2019)',
       'Public Use Microdata Area - United States of America (2019)',
       'County - United States of America (2015)',
       'School District (secondary) - United States of America (2019)',
       'School District (elementary) - United States of America (2019)',
       '5-digit Zip Code Tabulation Area - United States of America (2015)',
       'Census Block Group - United States of America (2015)'],
      dtype=object)

Now, let's select the data from 'Census Block Group - United States of America (2015)'

In [314]:
catalog_usa_demo_acs_bk = catalog_usa_demo_acs[catalog_usa_demo_acs.geography_name == 'Census Block Group - United States of America (2015)']

As this Census Block Group can have different tables we should list and select the most recent one. So if we list all tables:

In [315]:
catalog_usa_demo_acs_bk.id.to_list()
Out[315]:
['carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017',
 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20142018',
 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20122016',
 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20082012',
 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20072011',
 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20112015',
 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20102014',
 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20092013',
 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20062010']

To explore in greater detail the metadata associated to this table, we can use the Datasetcommand from Carto as follows:

In [316]:
Dataset.get('carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017').to_dict()
Out[316]:
{'slug': 'acs_sociodemogr_b758e778',
 'name': 'Sociodemographics - United States of America (Census Block Group, 2017, 5yrs)',
 'description': 'The American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about the USA and its people. This dataset contains only a subset of the variables that have been deemed most relevant. More info: https://www.census.gov/programs-surveys/acs/about.html',
 'category_id': 'demographics',
 'country_id': 'usa',
 'data_source_id': 'sociodemographics',
 'provider_id': 'usa_acs',
 'geography_name': 'Census Block Group - United States of America (2015)',
 'geography_description': 'Shoreline clipped TIGER/Line boundaries. More info: https://carto.com/blog/tiger-shoreline-clip/',
 'temporal_aggregation': '5yrs',
 'time_coverage': '[2013-01-01, 2018-01-01)',
 'update_frequency': 'yearly',
 'is_public_data': True,
 'lang': 'eng',
 'version': '20132017',
 'category_name': 'Demographics',
 'provider_name': 'American Community Survey',
 'geography_id': 'carto-do-public-data.carto.geography_usa_blockgroup_2015',
 'id': 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017'}

[EX10] According to the data description which is the main characteristics of the census block data in terms of time aggregation, private or public, updating periodicity, etc....?

In [317]:
print('This census block data has a temporal aggregation of 5 years, from 2013 to 2018 (not included). The data is public. ')
This census block data has a temporal aggregation of 5 years, from 2013 to 2018 (not included). The data is public. 

If we do the same exploration for the geometries associated to the selected census block, we should use the Geography method from Carto as follows:

In [318]:
Geography.get('carto-do-public-data.carto.geography_usa_blockgroup_2015').to_dict()
Out[318]:
{'slug': 'cdb_blockgroup_7753dd51',
 'name': 'Census Block Group - United States of America (2015)',
 'description': 'Shoreline clipped TIGER/Line boundaries. More info: https://carto.com/blog/tiger-shoreline-clip/',
 'country_id': 'usa',
 'provider_id': 'carto',
 'geom_type': 'MULTIPOLYGON',
 'update_frequency': None,
 'is_public_data': True,
 'lang': 'eng',
 'version': '2015',
 'provider_name': 'CARTO',
 'id': 'carto-do-public-data.carto.geography_usa_blockgroup_2015'}

Finally, we will build the query to get the Census data and Census geometries of the polygons that interesects with the Long Island geometry previously defined. To do it, we apply the following code:

In [319]:
df_X = Dataset.get('acs_sociodemogr_b758e778')
q = "select * from $dataset$ where ST_Intersects(geom,st_geogfromtext('{}'))".format(aoi.geometry.astype(str)[0])
df_X = df_X.to_dataframe(sql_query=q)
df_X['geometry'] = df_X['geom'].apply(lambda x: str_to_geom(x))
df_X = GeoDataFrame(df_X,geometry = df_X.geometry)
df_X.crs = {'init' :'epsg:4326'}
df_X.to_crs(epsg=4326, inplace=True)
df_X.head(3)
Out[319]:
geoid do_date total_pop households male_pop female_pop median_age male_under_5 male_5_to_9 male_10_to_14 ... pop_in_labor_force not_in_labor_force armed_forces civilian_labor_force do_label do_perimeter do_area do_num_vertices geom geometry
0 360470340002 2013-01-01 720.0 139.0 298.0 422.0 83.1 0.0 0.0 0.0 ... 0.0 720.0 0.0 0.0 Block Group 2 909.463 47648.525 9 POLYGON((-73.994726 40.57152, -73.993577 40.57... POLYGON ((-73.99473 40.57152, -73.99358 40.571...
1 360471058042 2013-01-01 784.0 586.0 316.0 468.0 80.0 0.0 0.0 0.0 ... 14.0 770.0 0.0 14.0 Block Group 2 1206.945 89274.575 9 POLYGON((-73.889591 40.651338, -73.889994 40.6... POLYGON ((-73.88959 40.65134, -73.88999 40.651...
2 360810085002 2013-01-01 79.0 35.0 54.0 25.0 32.8 0.0 0.0 3.0 ... 52.0 13.0 0.0 52.0 Block Group 2 1480.301 126149.190 10 POLYGON((-73.94435 40.756875, -73.942822 40.75... POLYGON ((-73.94435 40.75688, -73.94282 40.756...

3 rows × 171 columns

[EX11] Plot a map with df_X resulting from the interesetion of the Census Blocks and Long Island Geometry.

In [320]:
Map(Layer(df_X))
Out[320]:

[EX12] Filter the first 1000 census block in df_X and build the same map as before. What's the difference respect to EX11? Why?

In [321]:
print('Now only a 1000 census blocks are painted as we are not taking the whole dataframe.')
Map(Layer(df_X[:1000]))
Now only a 1000 census blocks are painted as we are not taking the whole dataframe.
Out[321]:

4. Data wrangling: normalization and dimensionality reduction

Normalization of extensive variables by the total population

When comparing data for irregular units like census block groups, extra care is needed for extensive variables, i.e. one whose value for a block can be viewed as a sum of sub-block values, as in the case of population. For extensive variables in fact we need first to normalize them, e.g. by dividing by the total area or the total population, depending on the application. Using the metadata available in CARTO Data Observatory we will check which of all the variables can be considered extensive by looking at their default aggregation method: if the aggregation method is classified as SUM, then the variable is normalized by dividing by the total population.

In [322]:
## Get the default aggregation methods
agg_methods_table = read_carto("""
                            SELECT column_name, 
                                    agg_method 
                            FROM variables_public 
                            WHERE dataset_id = 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017'""",
                            credentials = Credentials(username='do-metadata', api_key='default_public'))
agg_methods_table.dropna(inplace = True)
agg_methods_table.head()
Out[322]:
column_name agg_method
0 male_18_to_19 SUM
1 associates_degree SUM
2 income_100000_124999 SUM
3 families_with_young_children SUM
4 two_parents_not_in_labor_force_families_with_y... SUM

[EX13] For census block dataset, df_X, compute densities (i.e. column value divided by the the total population) for each numerical column that appears in agg_methods_table. Store the new value in a column with '_dens' suffix and delete the original one.

In [323]:
sum_var = agg_methods_table[agg_methods_table.agg_method == 'SUM']['column_name']

for item in sum_var:
    if item != 'total_pop':
        df_X[item] = df_X[item]/df_X["total_pop"]
        df_X[item+'_dens']=df_X[item]
        del df_X[item]
df_X
Out[323]:
geoid do_date total_pop median_income income_per_capita renter_occupied_housing_units_paying_cash_median_gross_rent owner_occupied_housing_units_lower_value_quartile owner_occupied_housing_units_median_value owner_occupied_housing_units_upper_value_quartile median_year_structure_built ... rent_35_to_40_percent_dens one_year_more_college_dens female_21_dens female_80_to_84_dens income_45000_49999_dens commute_10_14_mins_dens female_50_to_54_dens female_65_to_66_dens income_20000_24999_dens owner_occupied_housing_units_dens
0 360470340002 2013-01-01 720.0 11341.0 12284.0 247.0 NaN NaN NaN 1994.0 ... 0.000000 0.027778 0.000000 0.163889 0.000000 0.000000 0.000000 0.019444 0.025000 0.009722
1 360471058042 2013-01-01 784.0 12646.0 11530.0 322.0 NaN NaN NaN 1989.0 ... 0.038265 0.053571 0.000000 0.214286 0.017857 0.000000 0.000000 0.000000 0.093112 0.000000
2 360810085002 2013-01-01 79.0 NaN 11085.0 1631.0 NaN NaN NaN 1939.0 ... 0.037975 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 360811417005 2013-01-01 142.0 45833.0 20522.0 1319.0 NaN NaN NaN 1969.0 ... 0.105634 0.140845 0.000000 0.063380 0.063380 0.000000 0.070423 0.000000 0.000000 0.133803
4 360470900001 2013-01-01 401.0 41417.0 19454.0 1175.0 354000.0 483300.0 623100.0 1965.0 ... 0.000000 0.052369 0.057357 0.000000 0.000000 0.000000 0.049875 0.000000 0.000000 0.167082
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5892 361031457041 2013-01-01 2075.0 67833.0 27509.0 NaN 164000.0 211600.0 275800.0 1974.0 ... 0.000000 0.048193 0.000000 0.008193 0.006747 0.034217 0.028916 0.000000 0.000000 0.185542
5893 361031467032 2013-01-01 1898.0 84091.0 31188.0 NaN 244200.0 280300.0 332500.0 1957.0 ... 0.000000 0.079031 0.006322 0.010537 0.011064 0.096944 0.036354 0.010011 0.011064 0.296101
5894 361031121032 2013-01-01 1887.0 84107.0 37040.0 NaN 349000.0 397900.0 469100.0 1959.0 ... 0.000000 0.059883 0.007419 0.008479 0.005299 0.104399 0.067833 0.029677 0.013778 0.325914
5895 361031590001 2013-01-01 2347.0 68824.0 27357.0 NaN 267900.0 323400.0 378000.0 1962.0 ... 0.000000 0.106093 0.003409 0.029399 0.004261 0.060929 0.056668 0.025991 0.005113 0.222412
5896 361031352011 2013-01-01 2165.0 133083.0 46157.0 NaN 382700.0 441000.0 490200.0 1956.0 ... 0.000000 0.065589 0.001386 0.014319 0.002771 0.101617 0.045266 0.002309 0.010162 0.313626

5897 rows × 171 columns

In [324]:
df_X.columns.values
Out[324]:
array(['geoid', 'do_date', 'total_pop', 'median_income',
       'income_per_capita',
       'renter_occupied_housing_units_paying_cash_median_gross_rent',
       'owner_occupied_housing_units_lower_value_quartile',
       'owner_occupied_housing_units_median_value',
       'owner_occupied_housing_units_upper_value_quartile',
       'median_year_structure_built', 'median_rent',
       'percent_income_spent_on_rent', 'do_label', 'do_perimeter',
       'do_area', 'do_num_vertices', 'geom', 'geometry',
       'male_18_to_19_dens', 'associates_degree_dens',
       'income_100000_124999_dens', 'families_with_young_children_dens',
       'two_parents_not_in_labor_force_families_with_young_children_dens',
       'commute_45_59_mins_dens', 'male_25_to_29_dens', 'black_pop_dens',
       'million_dollar_housing_units_dens', 'not_in_labor_force_dens',
       'female_25_to_29_dens', 'two_or_more_races_pop_dens',
       'income_35000_39999_dens', 'rent_burden_not_computed_dens',
       'male_70_to_74_dens', 'female_75_to_79_dens',
       'rent_10_to_15_percent_dens', 'pop_in_labor_force_dens',
       'asian_pop_dens', 'amerindian_pop_dens',
       'black_including_hispanic_dens', 'income_30000_34999_dens',
       'households_retirement_income_dens',
       'vacant_housing_units_for_rent_dens', 'female_62_to_64_dens',
       'other_race_pop_dens', 'income_50000_59999_dens',
       'rent_20_to_25_percent_dens', 'less_one_year_college_dens',
       'male_40_to_44_dens', 'hispanic_pop_dens',
       'commute_35_44_mins_dens', 'commute_90_more_mins_dens',
       'rent_40_to_50_percent_dens', 'male_67_to_69_dens',
       'income_15000_19999_dens', 'income_40000_44999_dens',
       'nonfamily_households_dens', 'median_age_dens',
       'vacant_housing_units_dens', 'dwellings_2_units_dens',
       'male_15_to_17_dens', 'male_55_to_59_dens', 'male_60_to_61_dens',
       'female_5_to_9_dens', 'income_less_10000_dens',
       'dwellings_5_to_9_units_dens', 'commute_20_24_mins_dens',
       'female_35_to_39_dens', 'occupied_housing_units_dens',
       'dwellings_1_units_detached_dens', 'commute_15_19_mins_dens',
       'commute_60_more_mins_dens', 'male_80_to_84_dens',
       'bachelors_degree_dens', 'male_35_to_39_dens',
       'female_15_to_17_dens', 'female_70_to_74_dens',
       'income_75000_99999_dens', 'income_125000_149999_dens',
       'dwellings_10_to_19_units_dens', 'family_households_dens',
       'dwellings_3_to_4_units_dens', 'rent_under_10_percent_dens',
       'male_5_to_9_dens', 'male_21_dens', 'female_40_to_44_dens',
       'female_85_and_over_dens', 'income_10000_14999_dens',
       'income_60000_74999_dens', 'income_150000_199999_dens',
       'female_under_5_dens', 'not_hispanic_pop_dens',
       'dwellings_1_units_attached_dens', 'married_households_dens',
       'rent_30_to_35_percent_dens', 'female_45_to_49_dens',
       'female_60_to_61_dens', 'rent_15_to_20_percent_dens',
       'male_20_dens', 'female_55_to_59_dens',
       'two_parents_in_labor_force_families_with_young_children_dens',
       'one_parent_families_with_young_children_dens',
       'commute_60_89_mins_dens',
       'father_one_parent_families_with_young_children_dens',
       'commute_less_10_mins_dens', 'commute_25_29_mins_dens',
       'commute_35_39_mins_dens', 'aggregate_travel_time_to_work_dens',
       'households_dens', 'male_22_to_24_dens', 'female_20_dens',
       'asian_including_hispanic_dens',
       'amerindian_including_hispanic_dens', 'mobile_homes_dens',
       'housing_built_2005_or_later_dens', 'rent_over_50_percent_dens',
       'mortgaged_housing_units_dens', 'unemployed_pop_dens',
       'dwellings_50_or_more_units_dens', 'male_30_to_34_dens',
       'two_parents_father_in_labor_force_families_with_young_children_dens',
       'two_parents_mother_in_labor_force_families_with_young_children_dens',
       'commute_5_9_mins_dens', 'commute_40_44_mins_dens',
       'commuters_by_public_transportation_dens', 'male_45_to_49_dens',
       'female_10_to_14_dens', 'housing_units_dens',
       'vacant_housing_units_for_sale_dens', 'female_pop_dens',
       'female_18_to_19_dens', 'rent_25_to_30_percent_dens',
       'father_in_labor_force_one_parent_families_with_young_children_dens',
       'commute_30_34_mins_dens', 'female_22_to_24_dens',
       'female_30_to_34_dens', 'male_10_to_14_dens', 'white_pop_dens',
       'income_25000_29999_dens', 'male_under_5_dens',
       'male_62_to_64_dens', 'male_65_to_66_dens', 'male_75_to_79_dens',
       'white_including_hispanic_dens', 'income_200000_or_more_dens',
       'high_school_diploma_dens', 'civilian_labor_force_dens',
       'male_50_to_54_dens', 'female_67_to_69_dens',
       'pop_25_years_over_dens', 'housing_units_renter_occupied_dens',
       'dwellings_20_to_49_units_dens', 'commuters_16_over_dens',
       'masters_degree_dens', 'employed_pop_dens',
       'housing_built_2000_to_2004_dens',
       'housing_built_1939_or_earlier_dens', 'male_pop_dens',
       'pop_16_over_dens', 'two_parent_families_with_young_children_dens',
       'armed_forces_dens', 'male_85_and_over_dens',
       'rent_35_to_40_percent_dens', 'one_year_more_college_dens',
       'female_21_dens', 'female_80_to_84_dens',
       'income_45000_49999_dens', 'commute_10_14_mins_dens',
       'female_50_to_54_dens', 'female_65_to_66_dens',
       'income_20000_24999_dens', 'owner_occupied_housing_units_dens'],
      dtype=object)

The columns of your df_X should look like as follows:

[EX14] Delete unused columns with the following characteristics:

  • They are not float
  • They have the suffix 'do-'
  • total_popshould not be

We should keep the variables geo_id and geometry.

In [325]:
for col in df_X.columns:
  if col not in ["geoid", "geometry"]:
    if col not in df_X.select_dtypes(include=['float64']):
      del df_X[col]
    if "do-" in col:
      del df_X[col]
            
df_X.drop(['total_pop'], axis = 1, inplace=True)

df_X.head(3)
Out[325]:
geoid median_income income_per_capita renter_occupied_housing_units_paying_cash_median_gross_rent owner_occupied_housing_units_lower_value_quartile owner_occupied_housing_units_median_value owner_occupied_housing_units_upper_value_quartile median_year_structure_built median_rent percent_income_spent_on_rent ... rent_35_to_40_percent_dens one_year_more_college_dens female_21_dens female_80_to_84_dens income_45000_49999_dens commute_10_14_mins_dens female_50_to_54_dens female_65_to_66_dens income_20000_24999_dens owner_occupied_housing_units_dens
0 360470340002 11341.0 12284.0 247.0 NaN NaN NaN 1994.0 181.0 29.2 ... 0.000000 0.027778 0.0 0.163889 0.000000 0.0 0.0 0.019444 0.025000 0.009722
1 360471058042 12646.0 11530.0 322.0 NaN NaN NaN 1989.0 301.0 29.3 ... 0.038265 0.053571 0.0 0.214286 0.017857 0.0 0.0 0.000000 0.093112 0.000000
2 360810085002 NaN 11085.0 1631.0 NaN NaN NaN 1939.0 1620.0 50.0 ... 0.037975 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000

3 rows × 166 columns

Let's store in df_X_cols variable a list of the columns in df_X except of geoidand geometry.

In [326]:
df_X_cols = list(df_X.drop(['geoid','geometry'], axis = 1).columns)

Your dataframe should look like as:

In [327]:
df_X.head()
Out[327]:
geoid median_income income_per_capita renter_occupied_housing_units_paying_cash_median_gross_rent owner_occupied_housing_units_lower_value_quartile owner_occupied_housing_units_median_value owner_occupied_housing_units_upper_value_quartile median_year_structure_built median_rent percent_income_spent_on_rent ... rent_35_to_40_percent_dens one_year_more_college_dens female_21_dens female_80_to_84_dens income_45000_49999_dens commute_10_14_mins_dens female_50_to_54_dens female_65_to_66_dens income_20000_24999_dens owner_occupied_housing_units_dens
0 360470340002 11341.0 12284.0 247.0 NaN NaN NaN 1994.0 181.0 29.2 ... 0.000000 0.027778 0.000000 0.163889 0.000000 0.0 0.000000 0.019444 0.025000 0.009722
1 360471058042 12646.0 11530.0 322.0 NaN NaN NaN 1989.0 301.0 29.3 ... 0.038265 0.053571 0.000000 0.214286 0.017857 0.0 0.000000 0.000000 0.093112 0.000000
2 360810085002 NaN 11085.0 1631.0 NaN NaN NaN 1939.0 1620.0 50.0 ... 0.037975 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000
3 360811417005 45833.0 20522.0 1319.0 NaN NaN NaN 1969.0 1319.0 37.2 ... 0.105634 0.140845 0.000000 0.063380 0.063380 0.0 0.070423 0.000000 0.000000 0.133803
4 360470900001 41417.0 19454.0 1175.0 354000.0 483300.0 623100.0 1965.0 1175.0 50.0 ... 0.000000 0.052369 0.057357 0.000000 0.000000 0.0 0.049875 0.000000 0.000000 0.167082

5 rows × 166 columns

Any missing data?

Missing data are common in surveys, as the ACS. Before using the ACS data as covariates we therefore need to check if there are any missing data and to decide how to impute the missing values.

[EX15] Identify which variables have missing data and how many samples are missing.

In [328]:
df_X.isnull().sum()
Out[328]:
geoid                                                             0
median_income                                                   272
income_per_capita                                                41
renter_occupied_housing_units_paying_cash_median_gross_rent    1480
owner_occupied_housing_units_lower_value_quartile               873
                                                               ... 
commute_10_14_mins_dens                                          34
female_50_to_54_dens                                             34
female_65_to_66_dens                                             34
income_20000_24999_dens                                          34
owner_occupied_housing_units_dens                                34
Length: 166, dtype: int64

[EX16] If we drop all samples with at least a missing value, which will be the size of this new dataframe? Justify if it is enough to continue with this new dataframe or not

In [329]:
print(df_X.shape)
df_X_reduced = df_X.dropna()
print(df_X_reduced.shape)
print('The size of the new dataframe without the rows with Na values will only have 407 records out from 5897 it had previously. We are losing more than 90% of the data, there is no way we can continue, the results won\'t be significant and can be biased.')
(5897, 166)
(407, 166)
The size of the new dataframe without the rows with Na values will only have 407 records out from 5897 it had previously. We are losing more than 90% of the data, there is no way we can continue, the results won't be significant and can be biased.

Are the variables correlated?

Next, we will check the pair-wise correlations between the ACS variables.

[EX17] Calculate and plot the correlation matrix for 50 variables randomly selected. What do you observe?

In [330]:
def corr_heatmap(data):
    data_corr = data.corr().round(2)
    f, ax = plt.subplots(figsize=(15,15))
    sns.heatmap(data_corr,  vmin=-1, vmax=1, linewidths=.5, ax=ax)
    plt.show()

corr_heatmap(df_X.sample(n=50,axis='columns'))
print('We observe that, in general, the correlation among variables is small. However we can see some exceptions with positive (close to 1.0) and negative (close to -1.0) correlated variables.')
We observe that, in general, the correlation among variables is small. However we can see some exceptions with positive (close to 1.0) and negative (close to -1.0) correlated variables.

To account for missing values and reduce the model dimensionality, we will transform the data using Priciple Component Analysis

To impute the missing data and reduce the model dimensionality we will use a probabilistic formulation of Principal Component Analysis (PCA). PCA is a technique to transform data sets of high dimensional vectors into lower dimensional ones that finds a smaller-dimensional linear representation of data vectors such that the original data could be reconstructed from the compressed representation with the minimum square error.

In its probabilistic formulation, probabilistic PCA (PPCA) the complete data are modelled by a generative latent variable model which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components are not observed and are treated as latent variables, which also allows to extend the method to when missing data are present.

More information can be found at http://www.jmlr.org/papers/volume11/ilin10a/ilin10a.pdf.

PPCA in a nutshell

PCA can also be described as the maximum likelihood solution of a probabilistic latent variable model, which is known as PPCA:

\begin{equation*} Y_{ij} = \mathbf{P}_i \, \mathbf{Z}_j + m_i + \varepsilon_{ij} \quad i = 1, .., d; \, j = 1, .., n \end{equation*}

with

\begin{align} p(\mathbf{Z}_j) \sim N(0, \mathbb{1}) \\ p(\varepsilon_{ij}) \sim N(0, \nu) \end{align}

Both the principal components $Z$ and the noise $\varepsilon$ are assumed normally distributed. The model can be identified by finding the maximum likelihood (ML) estimate for the model parameters using the Expectation-Maximization (EM) algorithm by minimizing the mean-square error of the observed part of the data. EM is a general framework for learning parameters with incomplete data which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components, $Z_i$, are not observed and are treated as latent variables. When missing data are present, in the E-step, the expectation of the complete-data log-likelihood is taken with respect to the conditional distribution of the latent variables given the observed variables. In this case, the update EM rules are the following.

  1. E-step: \begin{align} \mathbf{\Sigma}_{\mathbf{Z}_j} = \nu \left(\nu \, \mathbb{1} + \sum_i \mathbf{P}_i \mathbf{P}_i^T \right)^{-1} \\ \overline{\mathbf{Z}}_j = \dfrac{1}{\nu}\mathbf{\Sigma}_{\mathbf{Z}_j} \sum_i \mathbf{P}_i \left(Y_{ij}- m_i \right) \\ m_{i} = \dfrac{1}{n} \sum_j \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \right) \\ \end{align}

  2. M-step: \begin{align} \mathbf{P}_{i} = \left( \sum_j \overline{\mathbf{Z}}_j \overline{\mathbf{Z}}_j ^T + \mathbf{\Sigma}_{\mathbf{Z}_j} \right)^{-1} \sum_j \overline{\mathbf{Z}}_j \, \left(Y_{ij}- m_{ij} \right)\\ \nu = \dfrac{1}{n} \sum_{ij} \left[ \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j - m_i \right)^2 + \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \mathbf{P}_i \, \right] \end{align}

where each row of $\mathbf{P}$ and $\overline{\mathbf{Z}}$ is recomputed based only on those columns of $\overline{\mathbf{Z}}$ which contribute to the reconstruction of the observed values in the corresponding row of the data matrix.

To impute missing values and extract PC scores (linearly decorrelated) we will use PPCA for our the df_X_colsof our dataframe as follows:

In [331]:
X, var_exp = run_ppca(df_X, df_X_cols, min_obs = 0.8)

At the tail columns we can observe the Principal Component variables:

In [332]:
X.head()
Out[332]:
geoid median_income income_per_capita renter_occupied_housing_units_paying_cash_median_gross_rent owner_occupied_housing_units_lower_value_quartile owner_occupied_housing_units_median_value owner_occupied_housing_units_upper_value_quartile median_year_structure_built median_rent percent_income_spent_on_rent ... pc_127 pc_128 pc_129 pc_130 pc_131 pc_132 pc_133 pc_134 pc_135 pc_136
0 360470340002 11341.0 12284.0 247.0 NaN NaN NaN 1994.0 181.0 29.2 ... -0.007402 -0.267329 -0.553755 -0.499219 0.140436 -0.707984 -0.095309 -0.022253 -0.069363 -0.020467
1 360471058042 12646.0 11530.0 322.0 NaN NaN NaN 1989.0 301.0 29.3 ... -0.172060 -0.416468 -0.630714 0.692891 0.031531 1.086017 -0.026878 -0.051493 -0.545295 0.125613
2 360810085002 NaN 11085.0 1631.0 NaN NaN NaN 1939.0 1620.0 50.0 ... -0.241232 -1.126270 0.253112 0.379541 0.208730 -0.282311 -0.645952 0.075613 -0.065487 -0.029871
3 360811417005 45833.0 20522.0 1319.0 NaN NaN NaN 1969.0 1319.0 37.2 ... 0.105176 0.466583 0.179606 0.055362 0.239443 0.219797 -0.035361 0.234292 0.222040 -0.029155
4 360470900001 41417.0 19454.0 1175.0 354000.0 483300.0 623100.0 1965.0 1175.0 50.0 ... 0.173744 0.298112 -0.159614 0.298278 0.265272 0.203781 0.603363 -0.329616 -0.105994 -0.022781

5 rows × 303 columns

[EX18] Plot the explained variance resulting from Principal Component Analysis. Tip: You can use the function plot_pc_var(var_exp, textsize, title). Which is the Principal Component that explains the most of the variance?

In [333]:
plot_pc_var(var_exp, textsize=15, title='PCA Explained Variance')
print('The Principal Component that explains most of the variance is the first one. The PC are sorted in decreasing order on explaining the variance.')
The Principal Component that explains most of the variance is the first one. The PC are sorted in decreasing order on explaining the variance.

We decide to keep the PCs up that explain up to 80% of the variance (the black dashed line in the plot below), but this choice is somewhat arbitrary and might vary depending on the application.

[EX19] To understand the relashionship between the transformed variables (the PC scores) and the original variable, we can plot the 4 variables most highly correlated with each PC. Use the function plot_pc_corr (X, df_X_cols)to plot these relationships. Which is the most positively correlated variable with PC0? Ant the most negatively correlated variable with PC0?

Note: depending on the random seed on your computer, the signs of the correlations might change.

In [335]:
plot_pc_corr(X, df_X_cols)
print('The most positively correlated variable with PC0 is HOUSING UNITS RENTER OCCUPIED DENS with more than 0.75. On the other hand, the most negatively correlated variable is OWNER OCCUPIED HOUSING UNITS DENS with a similar absolute value.')
The most positively correlated variable with PC0 is HOUSING UNITS RENTER OCCUPIED DENS with more than 0.75. On the other hand, the most negatively correlated variable is OWNER OCCUPIED HOUSING UNITS DENS with a similar absolute value.

[EX20] Create 3 maps as follows:

  • One with the geo-visualization of the first Principal Component
  • One with the geo-visualization of owner_occupied_housing_units_dens variable
  • One with the geo-visualization of housing_units_renter_occupied_densvariable
In [336]:
# geo-visualization of the first Principal Component
Map([Layer(X,style=color_bins_style('pc_0'), encode_data=False)])
Out[336]:
In [337]:
# geo-visualization of `owner_occupied_housing_units_dens` variable
Map([Layer(X,style=color_bins_style('owner_occupied_housing_units_dens'))])
Out[337]:
In [338]:
# geo-visualization of `housing_units_renter_occupied_dens`variable
Map([Layer(X,style=color_bins_style('housing_units_renter_occupied_dens'))])
Out[338]:

[EX21] Select only up the PC up to explain 80 % of the variance (dimensionality reduction)

In [343]:
index_list = ['geoid', 'geometry', 'pc_0']

cum_var_exp = np.cumsum(var_exp)
for index in range (0, len(cum_var_exp)):
    if cum_var_exp[index] < 0.8:
        index_list.append('pc_' + str(index+1))


#drop_cols = np.setdiff1d(X.columns, index_list)

for col in X.columns:
    if col not in index_list:
        X.drop([col], axis = 1, inplace=True)

X
Out[343]:
geoid geometry pc_0 pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 pc_7 ... pc_59 pc_60 pc_61 pc_62 pc_63 pc_64 pc_65 pc_66 pc_67 pc_68
0 360470340002 POLYGON ((-73.99473 40.57152, -73.99358 40.571... 0.559404 -6.908538 14.735062 -7.901487 -2.427835 5.595734 -5.171932 4.498502 ... -1.950631 0.348776 0.801331 -0.574349 -0.505093 0.774369 -2.283671 0.559578 0.007618 -0.146709
1 360471058042 POLYGON ((-73.88959 40.65134, -73.88999 40.651... 6.993617 0.129231 24.105093 -13.807347 -1.046774 1.164595 -3.533382 5.192408 ... 2.346550 0.047911 -5.535107 -3.279021 1.067112 -2.784222 -1.688500 0.926626 -1.392676 2.577099
2 360810085002 POLYGON ((-73.94435 40.75688, -73.94282 40.756... 7.293373 2.316299 -1.587367 -0.281509 3.490770 4.730380 -2.753776 -1.498218 ... 1.648538 1.949764 0.743265 1.141203 -2.182980 0.015906 1.103304 0.038698 -2.265194 1.313859
3 360811417005 POLYGON ((-73.79723 40.74170, -73.79727 40.740... 2.997390 -2.093021 4.274223 -3.839997 0.224379 -1.414656 -3.645954 -3.406002 ... -0.716797 1.544626 0.240153 0.511747 -1.402086 1.527858 -1.742347 -1.845593 3.177760 4.641683
4 360470900001 POLYGON ((-73.91826 40.66857, -73.91722 40.664... 4.259888 -4.800530 2.958818 0.158746 -2.756890 -2.001692 6.411877 0.823422 ... -3.508735 1.336166 0.143500 0.540500 -0.348818 -2.216179 -1.750566 1.425820 0.711947 -2.491029
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5892 361031457041 POLYGON ((-73.19007 40.80020, -73.19013 40.799... -1.073205 -4.495927 -1.942821 2.003686 2.956078 -1.020185 -0.566204 0.301432 ... -0.322835 -0.636193 0.113329 -1.012608 0.424731 -0.098133 -1.031075 0.662514 0.221623 -0.759275
5893 361031467032 POLYGON ((-73.30521 40.74090, -73.30455 40.740... -2.929162 -1.996071 -1.558574 1.706757 2.235890 -2.231739 -0.144337 -0.527786 ... -1.043892 0.449971 0.709522 0.277077 0.061942 0.479844 0.140910 0.445573 -0.004618 0.508719
5894 361031121032 POLYGON ((-73.31342 40.84017, -73.31293 40.838... -5.252166 -0.438449 0.841320 0.874311 1.911560 -1.708018 -0.987079 0.562970 ... -0.396561 0.830014 0.103353 -0.925654 0.781510 0.040904 0.259554 -0.074348 0.386426 0.052170
5895 361031590001 POLYGON ((-73.01337 40.76174, -73.01250 40.759... -0.547977 -2.389447 -0.064390 0.964480 3.012825 -0.400226 -0.640963 0.667340 ... -0.033455 0.182279 0.215250 0.630874 0.106424 0.632414 -0.249004 -0.638303 -1.743886 0.568900
5896 361031352011 POLYGON ((-73.28156 40.84333, -73.28156 40.842... -6.229585 0.095186 0.038300 0.489465 1.380086 -0.814124 -0.032986 1.388186 ... -0.764257 0.296854 0.319785 -0.126409 0.037046 -0.355279 -0.076809 -0.348780 -0.080620 0.175545

5897 rows × 71 columns

Your new X dataframe should look like as follows:

In [185]:
X
Out[185]:
geoid geometry pc_0 pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 pc_7 ... pc_58 pc_59 pc_60 pc_61 pc_62 pc_63 pc_64 pc_65 pc_66 pc_67
0 360470340002 POLYGON ((-73.99473 40.57152, -73.99358 40.571... 5.192626 -2.843960 -5.110736 10.490491 -8.011965 0.400330 -5.869855 -1.035688 ... 0.427732 1.133546 1.126083 2.826179 0.120132 1.486582 1.188179 0.987515 0.894004 1.349302
1 360471058042 POLYGON ((-73.88959 40.65134, -73.88999 40.651... 1.034807 -7.023206 -0.529719 19.242748 -11.120903 1.462790 -3.346949 0.925443 ... -1.140199 -1.632804 -0.007739 0.527558 -0.537280 0.422478 -1.854992 3.283291 3.549802 -0.887771
2 360810085002 POLYGON ((-73.94435 40.75688, -73.94282 40.756... 6.933532 -6.891814 0.043945 -2.201138 0.609429 2.362161 -2.065274 -2.664163 ... -0.155307 -1.399615 -0.614724 -2.836122 0.304700 1.312867 -0.182964 -2.502034 -2.559217 0.346803
3 360811417005 POLYGON ((-73.79723 40.74170, -73.79727 40.740... 6.825613 -4.070425 -1.726064 0.270334 -3.478294 0.982270 1.479187 -2.858226 ... 1.605078 -0.109263 -0.722469 -0.637094 0.514089 0.275256 1.561046 0.047552 0.283640 3.289071
4 360470900001 POLYGON ((-73.91826 40.66857, -73.91722 40.664... 5.178326 -6.088028 -3.511389 0.878919 -0.080299 -1.165851 2.400042 4.064214 ... -1.294178 0.240982 0.261707 1.921196 -2.199670 -0.875671 0.946583 1.725855 2.661502 -1.204959
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5892 361031457041 POLYGON ((-73.19007 40.80020, -73.19013 40.799... -2.838701 3.209901 -4.305829 -2.001303 1.036101 2.839832 0.582074 0.219115 ... 0.118098 -1.930918 -0.871813 -1.000672 -0.952126 -0.433922 1.079077 0.969601 1.947192 -0.207097
5893 361031467032 POLYGON ((-73.30521 40.74090, -73.30455 40.740... -2.138381 4.666200 -2.075725 -1.475258 1.214378 1.896171 2.170656 -0.474414 ... -0.335377 0.696566 0.205550 1.323278 -0.615716 0.667934 0.521443 -0.454292 -0.577949 0.834495
5894 361031121032 POLYGON ((-73.31342 40.84017, -73.31293 40.838... -2.273991 7.498023 -1.045971 1.481284 0.117481 1.822349 0.829146 -0.543341 ... 0.615100 -0.323500 -0.252571 0.325533 0.388333 -0.129070 0.325454 -0.817731 0.389817 -0.117358
5895 361031590001 POLYGON ((-73.01337 40.76174, -73.01250 40.759... -6.614139 4.265611 -2.658424 -0.042851 0.467846 2.774706 -0.638517 0.023603 ... -0.599605 -0.109841 -0.318167 -0.021246 0.854441 1.813779 1.209829 -0.587224 -2.009218 -0.334778
5896 361031352011 POLYGON ((-73.28156 40.84333, -73.28156 40.842... -2.531293 9.390975 0.371246 0.621706 0.218189 1.684551 -0.234682 0.360492 ... 0.355128 0.040058 -0.015801 0.128938 -0.176417 0.615251 0.184358 0.153766 -0.790937 0.023862

5897 rows × 70 columns

Finally we have to merge the Census Block (i.e. X dataframe) with the initial dataset with the information about Starbucks shops (i.e. storesdataframe). To do it, we will do a spatial join based on geometryattributes.

In [344]:
stores_enriched = gpd.sjoin(stores[['revenue','geometry']], X, how='right', op='intersects').drop('index_left', axis = 1)
stores_enriched
Out[344]:
revenue geoid geometry pc_0 pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 ... pc_59 pc_60 pc_61 pc_62 pc_63 pc_64 pc_65 pc_66 pc_67 pc_68
2672 1.016898e+06 360810711002 POLYGON ((-73.85603 40.72346, -73.85670 40.722... 0.252667 -1.592204 -2.028158 -2.627619 -1.268654 -2.726792 -0.706863 ... 0.555480 0.874447 0.443333 0.740931 -0.602890 0.191398 -0.775267 0.272021 0.522995 0.198035
2428 4.470646e+05 360810737002 POLYGON ((-73.84536 40.71925, -73.84668 40.717... 0.439377 6.139914 1.066426 -3.930415 -1.268071 -3.491042 0.139864 ... 0.063035 0.446690 1.646802 0.382731 -0.911506 0.995553 -1.578721 0.276976 0.542847 -0.377862
3153 7.028071e+05 360810645002 POLYGON ((-73.85714 40.71099, -73.85668 40.707... 1.074362 1.669466 -0.688634 -0.664445 -1.506192 -1.187679 0.058608 ... 0.461403 -0.911561 0.392867 0.169059 -0.665757 -0.062881 -0.449070 -0.362457 -0.704930 0.388533
3737 2.831908e+05 360810713053 POLYGON ((-73.85288 40.72638, -73.85351 40.725... 3.832132 7.784235 -1.892865 -0.196634 0.560720 -2.850941 -0.615845 ... 0.744937 -1.311534 1.884556 0.739733 -0.827812 -0.552496 0.303563 -0.602510 -0.577819 -0.983612
463 3.672994e+05 360810216001 POLYGON ((-73.82640 40.71565, -73.82698 40.715... 2.022870 4.268485 3.309649 -0.754714 -0.439542 -3.008471 -0.575690 ... -0.382852 -1.042494 0.752506 0.433582 -1.225601 -0.535040 -0.630196 -0.389563 -0.117868 -0.470984
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5892 NaN 361031457041 POLYGON ((-73.19007 40.80020, -73.19013 40.799... -1.073205 -4.495927 -1.942821 2.003686 2.956078 -1.020185 -0.566204 ... -0.322835 -0.636193 0.113329 -1.012608 0.424731 -0.098133 -1.031075 0.662514 0.221623 -0.759275
5893 NaN 361031467032 POLYGON ((-73.30521 40.74090, -73.30455 40.740... -2.929162 -1.996071 -1.558574 1.706757 2.235890 -2.231739 -0.144337 ... -1.043892 0.449971 0.709522 0.277077 0.061942 0.479844 0.140910 0.445573 -0.004618 0.508719
5894 NaN 361031121032 POLYGON ((-73.31342 40.84017, -73.31293 40.838... -5.252166 -0.438449 0.841320 0.874311 1.911560 -1.708018 -0.987079 ... -0.396561 0.830014 0.103353 -0.925654 0.781510 0.040904 0.259554 -0.074348 0.386426 0.052170
5895 NaN 361031590001 POLYGON ((-73.01337 40.76174, -73.01250 40.759... -0.547977 -2.389447 -0.064390 0.964480 3.012825 -0.400226 -0.640963 ... -0.033455 0.182279 0.215250 0.630874 0.106424 0.632414 -0.249004 -0.638303 -1.743886 0.568900
5896 NaN 361031352011 POLYGON ((-73.28156 40.84333, -73.28156 40.842... -6.229585 0.095186 0.038300 0.489465 1.380086 -0.814124 -0.032986 ... -0.764257 0.296854 0.319785 -0.126409 0.037046 -0.355279 -0.076809 -0.348780 -0.080620 0.175545

5898 rows × 72 columns

[EX 22] How many samples are with any value in revenue. Why are only 177 census blocks with revenue value?

In [188]:
print(f'There are {stores_enriched.revenue.count()} with any value in \'revenue\'.') 
print('There are only 177 census blocks with revenue value because those are the ones with starbucks stores on them and thus we can know the revenue value of that census blocks but no the others. ')
There are 177 with any value in 'revenue'.
There are only 177 census blocks with revenue value because those are the ones with starbucks stores on them and thus we can know the revenue value of that census blocks but no the others. 

5. Modelling: Train the model using the data for the existing stores

In this section of the project we are almost ready to train a model. As a regression problem, we will use a Linear Regression model as a baseline algorithm.

[EX 23] Build a dataframe df1 with only samples with census blocks with any value in revenuefeature.

In [345]:
df1 = stores_enriched.dropna()
df1
Out[345]:
revenue geoid geometry pc_0 pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 ... pc_59 pc_60 pc_61 pc_62 pc_63 pc_64 pc_65 pc_66 pc_67 pc_68
2672 1.016898e+06 360810711002 POLYGON ((-73.85603 40.72346, -73.85670 40.722... 0.252667 -1.592204 -2.028158 -2.627619 -1.268654 -2.726792 -0.706863 ... 0.555480 0.874447 0.443333 0.740931 -0.602890 0.191398 -0.775267 0.272021 0.522995 0.198035
2428 4.470646e+05 360810737002 POLYGON ((-73.84536 40.71925, -73.84668 40.717... 0.439377 6.139914 1.066426 -3.930415 -1.268071 -3.491042 0.139864 ... 0.063035 0.446690 1.646802 0.382731 -0.911506 0.995553 -1.578721 0.276976 0.542847 -0.377862
3153 7.028071e+05 360810645002 POLYGON ((-73.85714 40.71099, -73.85668 40.707... 1.074362 1.669466 -0.688634 -0.664445 -1.506192 -1.187679 0.058608 ... 0.461403 -0.911561 0.392867 0.169059 -0.665757 -0.062881 -0.449070 -0.362457 -0.704930 0.388533
3737 2.831908e+05 360810713053 POLYGON ((-73.85288 40.72638, -73.85351 40.725... 3.832132 7.784235 -1.892865 -0.196634 0.560720 -2.850941 -0.615845 ... 0.744937 -1.311534 1.884556 0.739733 -0.827812 -0.552496 0.303563 -0.602510 -0.577819 -0.983612
463 3.672994e+05 360810216001 POLYGON ((-73.82640 40.71565, -73.82698 40.715... 2.022870 4.268485 3.309649 -0.754714 -0.439542 -3.008471 -0.575690 ... -0.382852 -1.042494 0.752506 0.433582 -1.225601 -0.535040 -0.630196 -0.389563 -0.117868 -0.470984
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1959 7.583299e+05 360470043004 POLYGON ((-73.99327 40.68782, -73.99047 40.686... 1.727690 10.299295 -7.012968 -3.767682 1.520145 -0.247609 1.972996 ... 1.196609 0.033695 -0.102994 0.080921 -0.914189 0.313975 0.145072 0.504177 0.359990 -0.609600
1679 4.740776e+05 360470015001 POLYGON ((-73.98498 40.69721, -73.98489 40.696... 2.180453 13.042804 -4.562309 -1.915944 4.049099 1.762115 0.965293 ... -0.694440 -1.672980 -0.420981 1.746762 -0.180278 0.178025 0.093357 -0.019827 -0.271463 -0.704316
2657 4.233542e+05 360470011001 POLYGON ((-73.99045 40.69373, -73.99097 40.692... 2.747827 14.935410 -5.899751 -4.309660 5.043577 -0.139699 1.402684 ... -1.551677 -1.892928 -0.967848 1.621482 -1.388005 -0.243532 -1.171461 0.577366 0.261563 -0.152048
1564 8.716444e+05 360470009001 POLYGON ((-73.99374 40.69154, -73.99174 40.690... -0.819357 9.447584 -4.030089 -4.286903 0.625184 -0.124047 -0.148745 ... 1.364135 0.192317 0.706559 0.187288 1.332515 -0.921023 0.255505 -0.445249 -0.796541 0.266126
3195 5.300459e+05 360470005022 POLYGON ((-73.99508 40.69315, -73.99310 40.692... 1.615688 11.520969 -2.714698 -5.553594 1.634044 -1.578798 1.020614 ... 2.474426 0.747393 1.265640 0.711468 1.083995 -1.035516 0.937744 1.335903 0.076753 0.186234

177 rows × 72 columns

Your dataframe should look like as follows:

In [50]:
df1
Out[50]:
revenue geoid geometry pc_0 pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 ... pc_57 pc_58 pc_59 pc_60 pc_61 pc_62 pc_63 pc_64 pc_65 pc_66
2672 1.016898e+06 360810711002 POLYGON ((-73.85603 40.72346, -73.85670 40.722... 12.479078 4.487023 -1.051260 1.518435 -4.454319 4.266415 -4.703336 ... 0.517258 1.591402 1.366639 -0.568144 -0.279715 1.301647 0.102796 1.415551 -0.008628 -0.070181
2428 4.470646e+05 360810737002 POLYGON ((-73.84536 40.71925, -73.84668 40.717... -0.249307 -0.569163 -5.257708 -1.786308 -1.872905 0.871998 -3.142173 ... 0.734746 -1.184124 -0.064357 0.317999 -1.114184 0.730715 -0.351286 1.382345 1.272939 -1.318328
3153 7.028071e+05 360810645002 POLYGON ((-73.85714 40.71099, -73.85668 40.707... -1.049198 -1.336037 -1.712426 0.330946 -0.125553 1.165895 -1.574404 ... 0.082129 0.087422 -0.575829 0.423231 -1.389301 0.429835 0.255773 -0.326719 0.697345 0.677415
3737 2.831908e+05 360810713053 POLYGON ((-73.85288 40.72638, -73.85351 40.725... 3.096626 -2.559718 -6.686087 0.488271 1.767300 -0.746852 -2.777389 ... 2.291413 0.734310 -0.788513 0.671868 -1.983773 -0.327098 0.716763 -1.419552 0.268074 -0.536835
463 3.672994e+05 360810216001 POLYGON ((-73.82640 40.71565, -73.82698 40.715... 7.542061 0.630688 -3.902694 -4.933818 0.195453 1.219952 -2.633818 ... 0.746752 -0.357757 0.526710 -0.023541 -0.996121 -0.656266 -1.562447 -0.324467 1.823167 -0.200752
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1959 7.583299e+05 360470043004 POLYGON ((-73.99327 40.68782, -73.99047 40.686... 4.297083 0.111866 -11.458097 5.098344 -1.510299 -2.130083 -0.018044 ... 0.403197 -1.497805 -0.504742 0.314848 0.397894 1.085434 0.385326 -0.238540 0.007202 0.609491
1679 4.740776e+05 360470015001 POLYGON ((-73.98498 40.69721, -73.98489 40.696... 11.718780 1.982451 -16.378055 2.503568 0.466371 -4.231349 2.028808 ... 0.204177 -0.206085 -0.221415 0.601711 -1.107285 -0.814319 -1.052397 0.024652 -0.399906 0.362401
2657 4.233542e+05 360470011001 POLYGON ((-73.99045 40.69373, -73.99097 40.692... 4.117057 -0.549696 -14.311532 3.572769 -1.180380 -3.548954 -0.439085 ... -0.605875 -0.404134 -0.564784 0.033924 -0.700148 -0.734461 -0.579420 -0.795061 0.691078 0.301668
1564 8.716444e+05 360470009001 POLYGON ((-73.99374 40.69154, -73.99174 40.690... 10.478722 5.542242 -13.166676 1.335862 -2.375054 -0.561335 0.894898 ... 1.387108 0.819284 1.068894 -0.323931 1.612875 0.192176 -0.691359 -0.886945 -0.293483 -0.147496
3195 5.300459e+05 360470005022 POLYGON ((-73.99508 40.69315, -73.99310 40.692... 3.644949 -0.251786 -11.682077 0.307237 -2.650254 -2.321638 -1.077324 ... 1.403664 -0.758912 -0.591019 0.948665 0.537316 1.282358 1.835632 0.029874 0.549557 0.499413

177 rows × 70 columns

[EX 24] Build the X and y matrices with the following characteristics:

  • y with revenue scale divided by a factor of 10^5
  • X with only pc_i features
In [346]:
y = df1.revenue/10**5
filter_col = [col for col in df1 if col.startswith('pc_')]
X = df1[filter_col]

[EX 25] Train a Linear Regressionmodel and train it with X and y. Make a prediction and store it in df1in the new y_hatcolumn.

In [347]:
from sklearn.linear_model import LinearRegression

lin_regr  = LinearRegression()
lin_regr.fit(X, y)
df1['y_hat'] = lin_regr.predict(X)
In [348]:
pandas_scatter(df1,'revenue','y_hat', xlab = 'Observed', ylab = 'Predicted', pseudo_R2 = True)

If we exectute the function pandas_scatterwe can have a first measure of the performance. Your scatter plot should be similar to the following one:

In [349]:
pandas_scatter(df1,'revenue','y_hat', xlab = 'Observed', ylab = 'Predicted', pseudo_R2 = True)

[EX 25] To measure the performance in the right way, we should split the X and y to generate a X_train, y_train, X_test and y_test (80% for training and 20% for test). Which is the parameter to evaluate a regression model?

In [379]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

print('The parameter to evaluate a regression model is the R2 value, that shows how much of variability in dependent variable can be explained by the model.')
The parameter to evaluate a regression model is the R2 value, that shows how much of variability in dependent variable can be explained by the model.

[EX 26] Train a linear regression model and evaluate the score. Use the function r2_score(y_actual, y_prediction)to calculate the R^2 for y_train and y_test.

In [380]:
from sklearn.linear_model import LinearRegression

Lin_Regr  = LinearRegression()

Lin_Regr.fit(X_train, y_train)
y_predicted = Lin_Regr.predict(X_test)


r2_score(y_test, y_predicted)
Out[380]:
0.8374154092540762

6. Get the predictions by Census block group, visualize results and generate insights

[EX 27] Finally, we can use this model to predict the annual revenue ($) for each block group in Long Island. Make a copy of stores_enricheddataframe and store it in df2variable and execute the prediction based on the previous trained linear regression model. Store the prediction in a new revenue_predcolumn in df2 dataframe.

In [381]:
df2 = stores_enriched.copy()

df2['revenue_pred'] = Lin_Regr.predict(df2[filter_col])
df2.head()
Out[381]:
revenue geoid geometry pc_0 pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 ... pc_60 pc_61 pc_62 pc_63 pc_64 pc_65 pc_66 pc_67 pc_68 revenue_pred
2672 1.016898e+06 360810711002 POLYGON ((-73.85603 40.72346, -73.85670 40.722... 0.252667 -1.592204 -2.028158 -2.627619 -1.268654 -2.726792 -0.706863 ... 0.874447 0.443333 0.740931 -0.602890 0.191398 -0.775267 0.272021 0.522995 0.198035 10.660760
2428 4.470646e+05 360810737002 POLYGON ((-73.84536 40.71925, -73.84668 40.717... 0.439377 6.139914 1.066426 -3.930415 -1.268071 -3.491042 0.139864 ... 0.446690 1.646802 0.382731 -0.911506 0.995553 -1.578721 0.276976 0.542847 -0.377862 2.906586
3153 7.028071e+05 360810645002 POLYGON ((-73.85714 40.71099, -73.85668 40.707... 1.074362 1.669466 -0.688634 -0.664445 -1.506192 -1.187679 0.058608 ... -0.911561 0.392867 0.169059 -0.665757 -0.062881 -0.449070 -0.362457 -0.704930 0.388533 7.530527
3737 2.831908e+05 360810713053 POLYGON ((-73.85288 40.72638, -73.85351 40.725... 3.832132 7.784235 -1.892865 -0.196634 0.560720 -2.850941 -0.615845 ... -1.311534 1.884556 0.739733 -0.827812 -0.552496 0.303563 -0.602510 -0.577819 -0.983612 2.408051
463 3.672994e+05 360810216001 POLYGON ((-73.82640 40.71565, -73.82698 40.715... 2.022870 4.268485 3.309649 -0.754714 -0.439542 -3.008471 -0.575690 ... -1.042494 0.752506 0.433582 -1.225601 -0.535040 -0.630196 -0.389563 -0.117868 -0.470984 4.727256

5 rows × 73 columns

[EX 28] Create a map visualization with the following characteristics:

  • Layer with df2and the revenue_predvariable. Include also a widget to show a histogram for revenue_pred
  • Layer with storesand the revenuevariable (similar to EX7)
  • show_info: True
  • viewport: 'zoom': 9.4, 'lat': 40.673790, 'lng': -73.992685
  • size: 920, 300
In [384]:
plot_size = (920, 300)

breaks = [400000,700000, 1000000, 1300000, 1600000]
size_bins_st = size_bins_style('revenue', breaks = breaks)
size_bins_leg = size_bins_legend(title='Revenue', description='STARBUCKS', footer='Size of the blob')

viewport = {'zoom': 9.4, 'lat':40.673790, 'lng': -73.992685}

Map([Layer(df2,style=color_bins_style('revenue_pred'), title='Predicted Revenue',
          widgets = histogram_widget('revenue_pred', title = 'Histogram of revenue_pred')),
     Layer( stores, size_bins_st, size_bins_leg, title='Revenue')], size=plot_size, viewport=viewport, show_info=True)
Out[384]:

[EX 29] Select 3 areas where to place a new Starbucks shop. Justify your answer.

In [385]:
print('To place a new Starbucks store, we have to take into consideration the closeness of other Starbucks stores and the Predicted Revenue of the census block, to expect higher benefits.')
print('According to these reasons, I would place Starbucks store in the area of: viewport = {\'zoom\': 9.94, \'lat\':40.858563, \'lng\': -73.596159}, where there are high predicted revenue values and no stores close. Plus the ones closer are on the top break level of revenue')
print('The second one, and following the same procedures, would be placed in the area of: viewport = {\'zoom\': 9.99, \'lat\':40.800242, \'lng\': -73.385038}. And a third one, could be on the end of the island or maybe a better option is within the area of viewport = {\'zoom\': 9.91, \'lat\':40.846007, \'lng\': -72.843293}, since there is a big area without Starbucks stores nearby and the revenue prediction value of the area is pretty high (dark).')
To place a new Starbucks store, we have to take into consideration the closeness of other Starbucks stores and the Predicted Revenue of the census block, to expect higher benefits.
According to these reasons, I would place Starbucks store in the area of: viewport = {'zoom': 9.94, 'lat':40.858563, 'lng': -73.596159}, where there are high predicted revenue values and no stores close. Plus the ones closer are on the top break level of revenue
The second one, and following the same procedures, would be placed in the area of: viewport = {'zoom': 9.99, 'lat':40.800242, 'lng': -73.385038}. And a third one, could be on the end of the island or maybe a better option is within the area of viewport = {'zoom': 9.91, 'lat':40.846007, 'lng': -72.843293}, since there is a big area without Starbucks stores nearby and the revenue prediction value of the area is pretty high (dark).

Upload to Carto to share

If we would like to upload to Carto to share the new visualization, we should apply the following script:

In [387]:
to_carto(df2,'starbucks_long_island_predictions', if_exists = 'replace')

Conclusions

  • Using CARTO we were able to explore and visualize the Starbucks data and enrich them using demographics and socioeconomic data from CARTO Data Observatory
  • We built a Linear Regression model to predict the stores annual revenue from demographic and socioeconomic covariates, which were first transformed to impute missing data and reduce the model dimensionality.